This notebook is intended to accompany my presentation on "Teaching a Computer to Diagnose Cancer: An Introduction to Machine Learning."
Python has a number of excellent libraries available for scientific computing. This notebook uses several, including:
In [1]:
import pandas as pd
import numpy as np
import sklearn.cross_validation
import sklearn.neighbors
import sklearn.metrics
import matplotlib.pyplot as plt
import matplotlib.patches as mpatches
import seaborn as sns
# enable matplotlib mode
%matplotlib inline
# configure plot readability
sns.set_style("white")
sns.set_context("talk")
We will use the Wisconsin Breast Cancer Diagnostic Data Set (WDBC) to classify breast cancer cells as malignant or benign. This dataset has features computed from digital images of a fine needle aspirate (FNA) of breast mass. Each feature describes characteristics of the cell nuclei in a particular image. Since column headers are not included in the raw data, we manually encode the feature names.
To simplify our analysis, we will focus on only two features: largest radius and worst texture.
In [2]:
columns_to_features = { 1: "Diagnosis",
22: "Radius",
23: "Texture"}
features_to_keep = columns_to_features.values()
wdbc = pd.read_csv("wdbc.data", header = None) \
.rename(columns = columns_to_features) \
.filter(features_to_keep, axis = 1) \
.replace("M", "Malignant") \
.replace("B", "Benign")
It's important to gain some insight and intuition about our data. This will help us choose the right statistical model to use.
We'll start by looking at 5 randomly selected observations from the dataset.
Notice that:
In [3]:
wdbc.sample(5)
Out[3]:
Next, let's compare the number of malignant and benign observations by plotting the class distribution.
In [4]:
sns.countplot(x = "Diagnosis", data = wdbc)
sns.despine()
Let's plot the radius and texture for each observation.
In [5]:
sns.regplot(wdbc["Radius"], wdbc["Texture"], fit_reg = False)
Out[5]:
That's an interesting plot, but does it reveal anything? Let's take a closer look by generating the same plot, but color-coding it by diagnosis.
In [6]:
wdbc_benign = wdbc[wdbc["Diagnosis"] == "Benign"]
wdbc_malignant = wdbc[wdbc["Diagnosis"] == "Malignant"]
sns.regplot(wdbc_benign["Radius"], wdbc_benign["Texture"], color = "green", fit_reg = False)
sns.regplot(wdbc_malignant["Radius"], wdbc_malignant["Texture"], color = "red", fit_reg = False)
green_patch = mpatches.Patch(color = "green", label = "Benign")
red_patch = mpatches.Patch(color = "red", label = "Malignant")
plt.legend(handles=[green_patch, red_patch])
Out[6]:
This plot shows that the radius of malignant cells tends to be much larger than the radius of benign cells. As a result, the data appears to "separate" when plotted--the benign cells are clustered together on the left side of the plot, while the malignant cells are clustered together on the right side. You can use this information to intuitively classify new cells.
For example, let's consider classifying the following (radius, texture) pairs:
In [7]:
sns.regplot(wdbc_benign["Radius"], wdbc_benign["Texture"], color = "green", fit_reg = False)
sns.regplot(wdbc_malignant["Radius"], wdbc_malignant["Texture"], color = "red", fit_reg = False)
radii = np.array([10, 25, 15])
textures = np.array([20, 30, 25])
sns.regplot(radii, textures, color = "blue", fit_reg = False,
marker = "+", scatter_kws = {"s": 800})
green_patch = mpatches.Patch(color = "green", label = "Benign")
red_patch = mpatches.Patch(color = "red", label = "Malignant")
plt.legend(handles=[green_patch, red_patch])
Out[7]:
To classify the cells above, you intuitively applied a classification algorithm called k-nearest neighbors (kNN). Let's learn more about k-nearest neighbors and use it to train a computer to diagnose cells.
The best way to understand the k-nearest neighbors algorithm is to trace its execution. Here is how the k-nearest neighbors algorithm would diagnose a new cell given its radius and texture:
Although the kNN algorithm is simple and straightforward, it is extremely powerful. It is often one of the first (and most successful) algorithms that data scientists apply to a new dataset.
Let's train a k-Nearest Neighbors model to diagnose new cell observations.
Since we want to know how well our model generalizes (i.e. how well it can diagnose new observations that it hasn't seen before) we first split our dataset into separate training and test sets. We will train the model with the training set then analyze its performance with the test set.
In [8]:
x_full = wdbc.drop("Diagnosis", axis = 1)
y_full = wdbc["Diagnosis"]
x_train, x_test, y_train, y_test = sklearn.cross_validation.train_test_split(
x_full, y_full, test_size = 0.3, random_state = 3)
Now we can train a k-nearest neighbors classifier using our training set. This is surprisingly straightforward since we use an implementation provided by the SciKit-Learn machine learning library.
In [9]:
model = sklearn.neighbors.KNeighborsClassifier().fit(x_train, y_train)
How well does our model work? To learn how well it can diagnose new cells, we will use it to predict the diagnosis of each test set observation. Since we know the true diagnosis of each test set observation, we can calculate the model's accuracy by comparing the predicted and actual diagnoses.
In [10]:
predictions = model.predict(x_test)
sklearn.metrics.accuracy_score(predictions, y_test)
Out[10]:
We explored the Wisconsin Breast Cancer Diagnosis Dataset and found that malignant cells tend to have a larger radius than benign cells. Knowing this information, we could intuitively classify new cell observations. We captured this intuition in code by training a k-nearest neighbors model and found that it can correctly predict the diagnosis of a new observation with 95% accuracy.